Every learner should submit his/her own homework solutions. However, you are allowed to discuss the homework with each other– but everyone must submit his/her own solution; you may not copy someone else’s solution.
The homework consists of two parts:
Follow the prompts in the attached jupyter notebook. Download the data (it is the same as for the previous assignment) and place it in your working directory, or modify the path to upload it to your notebook. Add markdown cells to your analysis to include your solutions, comments, answers. Add as many cells as you need, for easy readability comment when possible. Hopefully this homework will help you develop skills, make you understand the flow of an EDA, get you ready for individual work.
Submission: Send in both a ipynb and an html file of your work.
Good luck!
Title: 1985 Auto Imports Database
Relevant Information: -- Description This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured
vehicle year. This value is normalized for all autos within a
particular size classification (two-door small, station wagons,
sports/speciality, etc...), and represents the average loss per car
per year.
-- Note: Several of the attributes in the database could be used as a "class" attribute.
Number of Instances: 205
Number of Attributes: 26 total -- 15 continuous -- 1 integer -- 10 nominal
Attribute Information:
Attribute: Attribute Range:
Missing Attribute Values: (denoted by "?")
from scipy import stats
from sklearn.linear_model import LinearRegression
from statsmodels.compat import lzip
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from statsmodels.compat import lzip
import statsmodels
%matplotlib inline
#Read in data
df2 =pd.read_csv('/content/auto_imp (1).csv')
df2
| fuel_type | wheel_base | length | width | heights | curb_weight | engine_size | bore | stroke | comprassion | horse_power | peak_rpm | city_mpg | highway_mpg | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | gas | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
| 1 | gas | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
| 2 | gas | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
| 3 | gas | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
| 4 | gas | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 190 | gas | 109.1 | 188.8 | 68.9 | 55.5 | 2952 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 23 | 28 | 16845 |
| 191 | gas | 109.1 | 188.8 | 68.8 | 55.5 | 3049 | 141 | 3.78 | 3.15 | 8.7 | 160 | 5300 | 19 | 25 | 19045 |
| 192 | gas | 109.1 | 188.8 | 68.9 | 55.5 | 3012 | 173 | 3.58 | 2.87 | 8.8 | 134 | 5500 | 18 | 23 | 21485 |
| 193 | diesel | 109.1 | 188.8 | 68.9 | 55.5 | 3217 | 145 | 3.01 | 3.40 | 23.0 | 106 | 4800 | 26 | 27 | 22470 |
| 194 | gas | 109.1 | 188.8 | 68.9 | 55.5 | 3062 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 19 | 25 | 22625 |
195 rows × 15 columns
df2=pd.get_dummies(df2, columns=['fuel_type'],drop_first=True)
By using dummy variables, the category variable fuel_type is transformed into numerical variables that can be used in regression analysis.
The fuel_type column in the DataFrame df2 is generated using fake variables by the aforementioned method. The drop_first=True option drops the first dummy variable column to avoid multicollinearity.
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195 entries, 0 to 194 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 wheel_base 195 non-null float64 1 length 195 non-null float64 2 width 195 non-null float64 3 heights 195 non-null float64 4 curb_weight 195 non-null int64 5 engine_size 195 non-null int64 6 bore 195 non-null float64 7 stroke 195 non-null float64 8 comprassion 195 non-null float64 9 horse_power 195 non-null int64 10 peak_rpm 195 non-null int64 11 city_mpg 195 non-null int64 12 highway_mpg 195 non-null int64 13 price 195 non-null int64 14 fuel_type_gas 195 non-null uint8 dtypes: float64(7), int64(7), uint8(1) memory usage: 21.6 KB
The auto imports dataset has 195 rows and 15 columns when the dummy variables are added, and as can be observed, none of the columns have null values. The columns contain a range of data types, including float, integer, and uint8.
df2.head()
| wheel_base | length | width | heights | curb_weight | engine_size | bore | stroke | comprassion | horse_power | peak_rpm | city_mpg | highway_mpg | price | fuel_type_gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 | 1 |
| 1 | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 | 1 |
| 2 | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 | 1 |
| 3 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 | 1 |
| 4 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 | 1 |
Top 5 observations of the auto imports dataset are returned.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.
Follow the lecture notes for ideas of how to perform EDA on your dataset. For help, here are the steps we talked about:
Suggested Steps in EDA:
Provide descriptions of your sample and features
Check for missing data
Identify the shape of your data
Identify significant correlations
These steps are a guidline. Try different things and share your insights about the dataset (df2).
Don't forget to add "markdown" cells to include your findings or to explain what you are doing
## Your EDA should start here
df2.describe()
| wheel_base | length | width | heights | curb_weight | engine_size | bore | stroke | comprassion | horse_power | peak_rpm | city_mpg | highway_mpg | price | fuel_type_gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 | 195.000000 |
| mean | 98.896410 | 174.256923 | 65.886154 | 53.861538 | 2559.000000 | 127.938462 | 3.329385 | 3.250308 | 10.194974 | 103.271795 | 5099.487179 | 25.374359 | 30.841026 | 13248.015385 | 0.897436 |
| std | 6.132038 | 12.476443 | 2.132484 | 2.396778 | 524.715799 | 41.433916 | 0.271866 | 0.314115 | 4.062109 | 37.869730 | 468.271381 | 6.401382 | 6.829315 | 8056.330093 | 0.304170 |
| min | 86.600000 | 141.100000 | 60.300000 | 47.800000 | 1488.000000 | 61.000000 | 2.540000 | 2.070000 | 7.000000 | 48.000000 | 4150.000000 | 13.000000 | 16.000000 | 5118.000000 | 0.000000 |
| 25% | 94.500000 | 166.300000 | 64.050000 | 52.000000 | 2145.000000 | 98.000000 | 3.150000 | 3.110000 | 8.500000 | 70.000000 | 4800.000000 | 19.500000 | 25.000000 | 7756.500000 | 1.000000 |
| 50% | 97.000000 | 173.200000 | 65.400000 | 54.100000 | 2414.000000 | 120.000000 | 3.310000 | 3.290000 | 9.000000 | 95.000000 | 5100.000000 | 25.000000 | 30.000000 | 10245.000000 | 1.000000 |
| 75% | 102.400000 | 184.050000 | 66.900000 | 55.650000 | 2943.500000 | 145.500000 | 3.590000 | 3.410000 | 9.400000 | 116.000000 | 5500.000000 | 30.000000 | 35.000000 | 16509.000000 | 1.000000 |
| max | 120.900000 | 208.100000 | 72.000000 | 59.800000 | 4066.000000 | 326.000000 | 3.940000 | 4.170000 | 23.000000 | 262.000000 | 6600.000000 | 49.000000 | 54.000000 | 45400.000000 | 1.000000 |
Df2.describe() provides the statistical breakdown of the numerical variables in the auto imports dataset. For each numerical column, it shows the count, mean, standard deviation, minimum value, median value (50th percentile), maximum value, and 75th percentile. The number of columns in each table and its data type are also shown. In the final row, you can see the statistical analysis of the binary column fuel_type_gas.
df2.isnull().sum()
wheel_base 0 length 0 width 0 heights 0 curb_weight 0 engine_size 0 bore 0 stroke 0 comprassion 0 horse_power 0 peak_rpm 0 city_mpg 0 highway_mpg 0 price 0 fuel_type_gas 0 dtype: int64
Each column is examined by the isnull().sum() function for any missing values (sometimes referred to as NaN values). It returns a Series object, where the values represent the total number of values that are missing from each column and the indices represent the column names.
The output shows that there are no missing values in any of the columns of the autoimports dataframe because the total of the missing values for each column is 0.
df2.shape
(195, 15)
df2.columns
Index(['wheel_base', 'length', 'width', 'heights', 'curb_weight',
'engine_size', 'bore', 'stroke', 'comprassion', 'horse_power',
'peak_rpm', 'city_mpg', 'highway_mpg', 'price', 'fuel_type_gas'],
dtype='object')
Df2.shape will produce a tuple with 195 rows and 15 columns (195, 15) for the auto imports data set. Instead of manually validating a DataFrame's dimensions, this method is useful for doing so quickly. It's vital to keep in mind that methods like df.drop() and df.reshape can be used to change a DataFrame's shape rather than the read-only attribute df_fish.shape. And using .column() function, the total variables present in the auto imports dataset are displayed.
sns.pairplot(data=df2, height=1.5)
<seaborn.axisgrid.PairGrid at 0x7f06b222c910>
Each numerical variable in the auto imports dataset is shown pairwise using the method above. The height parameter in inches specifies the height of the plot. As a consequence, a grid of scatter plots showing diagonal histograms of each variable's distribution is created. The scatter plots show the relationships between each pair of variables.
fig,axes = plt.subplots(5,3, figsize = (17,17))
axes[0,0].set_title("wheel_base")
axes[0,0].hist(df2['wheel_base'], bins = 7,color='red')
axes[0,1].set_title("length")
axes[0,1].hist(df2['length'], bins=5,color='teal');
axes[0,2].set_title("width")
axes[0,2].hist(df2['width'], bins=6,color='green');
axes[1,0].set_title("heights")
axes[1,0].hist(df2['heights'], bins=6,color='blue');
axes[1,1].set_title("curb_weight")
axes[1,1].hist(df2['curb_weight'], bins=6,color='purple');
axes[1,2].set_title('engine_size')
axes[1,2].hist(df2['engine_size'], bins=6,color='orange')
axes[2,0].set_title("bore")
axes[2,0].hist(df2['bore'], bins = 7,color='red')
axes[2,1].set_title("stroke")
axes[2,1].hist(df2['stroke'], bins=5,color='teal');
axes[2,2].set_title("comprassion")
axes[2,2].hist(df2['comprassion'], bins=6,color='green');
axes[3,0].set_title("horse_power")
axes[3,0].hist(df2['horse_power'], bins=6);
axes[3,1].set_title("peak_rpm")
axes[3,1].hist(df2['peak_rpm'], bins=6,color='purple');
axes[3,2].set_title('city_mpg')
axes[3,2].hist(df2['city_mpg'], bins=6)
axes[4,0].set_title("highway_mpg")
axes[4,0].hist(df2['highway_mpg'], bins = 6,color='orange')
axes[4,1].set_title("price")
axes[4,1].hist(df2['price'], bins=6,color='teal');
axes[4,2].set_title("fuel_type_gas")
axes[4,2].hist(df2['fuel_type_gas'], bins=7,color='green');
A 5x3 grid of histograms is created to display the distributions of the different features in the auto imports dataset.
The x-axis of each histogram shows the range of values for that specific attribute, while the y-axis shows how frequently each range of values occurs. For instance, the "length" variable's histogram reveals that the majority of the cars in the dataset are between 160 and 180 inches long, and the "fuel_type_gas" variable's histogram reveals that virtually all of the vehicles in the dataset run on gasoline.
In some of the variables, such as the "horse_power" variable, we can also see some outliers.
df2.corr(method = 'pearson')
| wheel_base | length | width | heights | curb_weight | engine_size | bore | stroke | comprassion | horse_power | peak_rpm | city_mpg | highway_mpg | price | fuel_type_gas | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wheel_base | 1.000000 | 0.879222 | 0.819009 | 0.592500 | 0.782720 | 0.569704 | 0.498228 | 0.171722 | 0.247730 | 0.375541 | -0.352331 | -0.499126 | -0.566355 | 0.585793 | -0.303643 |
| length | 0.879222 | 1.000000 | 0.858084 | 0.496218 | 0.881665 | 0.687479 | 0.609437 | 0.118664 | 0.160172 | 0.583813 | -0.280986 | -0.689660 | -0.719324 | 0.695331 | -0.210618 |
| width | 0.819009 | 0.858084 | 1.000000 | 0.315834 | 0.867315 | 0.740320 | 0.544311 | 0.186432 | 0.190997 | 0.616779 | -0.251627 | -0.647099 | -0.692220 | 0.754273 | -0.245375 |
| heights | 0.592500 | 0.496218 | 0.315834 | 1.000000 | 0.307732 | 0.031286 | 0.189283 | -0.055525 | 0.261160 | -0.084412 | -0.264078 | -0.102367 | -0.151188 | 0.138291 | -0.279070 |
| curb_weight | 0.782720 | 0.881665 | 0.867315 | 0.307732 | 1.000000 | 0.857573 | 0.645806 | 0.172785 | 0.155382 | 0.760285 | -0.278944 | -0.772171 | -0.812710 | 0.835729 | -0.219488 |
| engine_size | 0.569704 | 0.687479 | 0.740320 | 0.031286 | 0.857573 | 1.000000 | 0.583091 | 0.211989 | 0.024617 | 0.842691 | -0.219008 | -0.710624 | -0.732138 | 0.888942 | -0.063490 |
| bore | 0.498228 | 0.609437 | 0.544311 | 0.189283 | 0.645806 | 0.583091 | 1.000000 | -0.066793 | 0.003057 | 0.568527 | -0.277662 | -0.591950 | -0.600040 | 0.546873 | -0.056245 |
| stroke | 0.171722 | 0.118664 | 0.186432 | -0.055525 | 0.172785 | 0.211989 | -0.066793 | 1.000000 | 0.199882 | 0.100040 | -0.068300 | -0.027641 | -0.036453 | 0.093746 | -0.253774 |
| comprassion | 0.247730 | 0.160172 | 0.190997 | 0.261160 | 0.155382 | 0.024617 | 0.003057 | 0.199882 | 1.000000 | -0.214401 | -0.444582 | 0.331413 | 0.267941 | 0.069500 | -0.985398 |
| horse_power | 0.375541 | 0.583813 | 0.616779 | -0.084412 | 0.760285 | 0.842691 | 0.568527 | 0.100040 | -0.214401 | 1.000000 | 0.105654 | -0.834117 | -0.812917 | 0.811027 | 0.168454 |
| peak_rpm | -0.352331 | -0.280986 | -0.251627 | -0.264078 | -0.278944 | -0.219008 | -0.277662 | -0.068300 | -0.444582 | 0.105654 | 1.000000 | -0.069493 | -0.016950 | -0.104333 | 0.480952 |
| city_mpg | -0.499126 | -0.689660 | -0.647099 | -0.102367 | -0.772171 | -0.710624 | -0.591950 | -0.027641 | 0.331413 | -0.834117 | -0.069493 | 1.000000 | 0.972350 | -0.702685 | -0.260796 |
| highway_mpg | -0.566355 | -0.719324 | -0.692220 | -0.151188 | -0.812710 | -0.732138 | -0.600040 | -0.036453 | 0.267941 | -0.812917 | -0.016950 | 0.972350 | 1.000000 | -0.715590 | -0.193998 |
| price | 0.585793 | 0.695331 | 0.754273 | 0.138291 | 0.835729 | 0.888942 | 0.546873 | 0.093746 | 0.069500 | 0.811027 | -0.104333 | -0.702685 | -0.715590 | 1.000000 | -0.108968 |
| fuel_type_gas | -0.303643 | -0.210618 | -0.245375 | -0.279070 | -0.219488 | -0.063490 | -0.056245 | -0.253774 | -0.985398 | 0.168454 | 0.480952 | -0.260796 | -0.193998 | -0.108968 | 1.000000 |
Certain observations have been made from the above correlation table:
--> Strong positive correlation between the following variables: length and wheel base
width and curb weight
engine size and horsepower
city mpg and highway mpg
-->Strong negative correlation between the following variables:
city mpg and horsepower
highway mpg and horsepower
price and fuel_type_gas
-->Weak positive correlation between the following variables:
wheel base and engine size
wheel base and curb weight
length and engine size
length and curb weight
width and engine size
width and horsepower
height and compression ratio
-->Weak negative correlation between the following variables:
wheel base and city mpg
wheel base and highway mpg
length and city mpg
length and highway mpg
width and city mpg
width and highway mpg
height and highway mpg
price and horsepower
price and engine size
price and curb weight
plt.figure(figsize=(10, 8))
sns.heatmap(df2.corr(method='pearson'), annot=True)
plt.show()
The Pearson correlation coefficient calculates the linear association between two variables and ranges from -1 (totally negative correlation) to 1 (absolutely positive correlation).
The heatmap shows the strength and direction of the connections between the variables in the auto imports dataset. The color coding of the cells indicates the strength of the relationship, with deeper colors signifying stronger connections. Positive numbers in the cells imply positive correlations, whereas negative values in the cells reflect negative correlations.
The vehicle imports dataset's engine-size and price columns have a high positive correlation, which suggests that more expensive engines typically have larger engines. More powerful cars are frequently more expensive, as shown by the significant positive correlation between the horsepower and price columns. On the other hand, the city-mpg and highway-mpg columns have an inverse correlation, indicating that cars with higher city fuel economy often have worse highway fuel efficiency.
1. Create a model that uses all the variables and call it model1. The dependent variable is price, the independent variables are all the rest. Print out a summary of the model (coefficents, stanrard errors, confidence intervals and other metrics shown in class and answer the quesions based on your output.
##Your code goes here
y = df2['price']
X = df2.drop('price', axis=1)
X = sm.add_constant(X) #constant created to the independent variables
model1 = sm.OLS(y, X).fit()
model1
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f06a6b41570>
The Python code above runs a straightforward multiple linear regression analysis, where the y variable stands for the response or dependent variable and the x variable for the predictor or independent variables. The dependent variable's linear relationship with the independent variables is estimated using the regression model, and the constant acts as the regression line's intercept.
model1.summary()
| Dep. Variable: | price | R-squared: | 0.860 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.849 |
| Method: | Least Squares | F-statistic: | 78.89 |
| Date: | Sun, 30 Apr 2023 | Prob (F-statistic): | 5.84e-69 |
| Time: | 04:16:57 | Log-Likelihood: | -1838.5 |
| No. Observations: | 195 | AIC: | 3707. |
| Df Residuals: | 180 | BIC: | 3756. |
| Df Model: | 14 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -4.45e+04 | 1.84e+04 | -2.419 | 0.017 | -8.08e+04 | -8194.301 |
| wheel_base | 39.5305 | 103.549 | 0.382 | 0.703 | -164.796 | 243.857 |
| length | -60.6333 | 58.500 | -1.036 | 0.301 | -176.068 | 54.801 |
| width | 603.6414 | 254.539 | 2.372 | 0.019 | 101.377 | 1105.906 |
| heights | 329.5669 | 140.947 | 2.338 | 0.020 | 51.446 | 607.688 |
| curb_weight | 1.1798 | 1.738 | 0.679 | 0.498 | -2.249 | 4.609 |
| engine_size | 138.4537 | 16.111 | 8.594 | 0.000 | 106.662 | 170.245 |
| bore | -1208.4137 | 1206.683 | -1.001 | 0.318 | -3589.479 | 1172.651 |
| stroke | -3706.0531 | 874.513 | -4.238 | 0.000 | -5431.669 | -1980.437 |
| comprassion | -617.1497 | 446.452 | -1.382 | 0.169 | -1498.103 | 263.804 |
| horse_power | 34.6328 | 18.049 | 1.919 | 0.057 | -0.982 | 70.248 |
| peak_rpm | 2.5517 | 0.709 | 3.599 | 0.000 | 1.153 | 3.951 |
| city_mpg | -288.2868 | 180.791 | -1.595 | 0.113 | -645.030 | 68.456 |
| highway_mpg | 316.6334 | 163.540 | 1.936 | 0.054 | -6.069 | 639.336 |
| fuel_type_gas | -1.173e+04 | 6002.268 | -1.955 | 0.052 | -2.36e+04 | 110.854 |
| Omnibus: | 18.136 | Durbin-Watson: | 0.978 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 55.211 |
| Skew: | 0.240 | Prob(JB): | 1.03e-12 |
| Kurtosis: | 5.562 | Cond. No. | 4.77e+05 |
The model achieved an R-squared value of 0.860, indicating that approximately 86% of the variance in the dependent variable is explained by the independent variables.The adjusted R-squared value is 0.849.The model is statistically significant, according to the F-statistic of 78.89, which has a probability of 5.84e-69.
Additionally displayed in the output are the coefficients for each independent variable. When all other independent variables are held constant, the coefficient estimates show the change in the dependent variable for a one-unit change in the related independent variable. The p-values, t-statistics, and standard errors are also displayed.
#variance of the model
model1.mse_resid
9802534.782392945
#95% confidence interval
model1.conf_int(alpha=0.05, cols=None)
| 0 | 1 | |
|---|---|---|
| const | -80804.687242 | -8194.301393 |
| wheel_base | -164.796403 | 243.857427 |
| length | -176.067817 | 54.801253 |
| width | 101.376931 | 1105.905788 |
| heights | 51.445805 | 607.688013 |
| curb_weight | -2.248863 | 4.608553 |
| engine_size | 106.662153 | 170.245178 |
| bore | -3589.478552 | 1172.651235 |
| stroke | -5431.669228 | -1980.436936 |
| comprassion | -1498.103020 | 263.803582 |
| horse_power | -0.982202 | 70.247851 |
| peak_rpm | 1.152653 | 3.950693 |
| city_mpg | -645.029989 | 68.456450 |
| highway_mpg | -6.069257 | 639.336115 |
| fuel_type_gas | -23576.867004 | 110.854470 |
# 99% confidence interval
model1.conf_int(alpha=0.01, cols=None)
| 0 | 1 | |
|---|---|---|
| const | -92399.415577 | 3400.426942 |
| wheel_base | -230.051946 | 309.112970 |
| length | -212.933950 | 91.667387 |
| width | -59.030413 | 1266.313132 |
| heights | -37.377264 | 696.511082 |
| curb_weight | -3.343883 | 5.703574 |
| engine_size | 96.508951 | 180.398379 |
| bore | -4349.915238 | 1933.087921 |
| stroke | -5982.776351 | -1429.329813 |
| comprassion | -1779.451593 | 545.152154 |
| horse_power | -12.356513 | 81.622162 |
| peak_rpm | 0.705850 | 4.397495 |
| city_mpg | -758.962470 | 182.388931 |
| highway_mpg | -109.130271 | 742.397128 |
| fuel_type_gas | -27359.420865 | 3893.408332 |
2.The following variables are statistically significant: const, width, heights, engine_size, stroke, peak_rpm Six factors are therefore statistically significant at the 5% level.
3.The variance of the model, also known as the mean squared error (MSE), can be found in the OLS Regression Results output as mse_resid. In this case, the value of mse_resid is 9802534.782392945.
4.The coefficient of determination, also known as R-squared (R²), is a statistical measure that represents the proportion of the variance in the dependent variable (price) that can be explained by the independent variables (wheel_base, length, width, heights, curb_weight, engine_size, bore, stroke, compression, horsepower, peak_rpm, city_mpg, highway_mpg, fuel_type_gas) in the linear regression model.
In this case, the R-squared value is 0.860, which means that approximately 86% of the variation in car prices can be explained by the variables included in the model. This indicates that the model is a good fit for the data and that the selected independent variables have a strong relationship with the dependent variable. However, it is important to note that there may be other variables not included in the model that also affect car prices. Additionally, the model's performance can be further evaluated by examining other statistical measures such as the F-statistic, t-values, and p-values.
5.To evaluate the overall significance of the regression model, the F-statistic is employed. The F-statistic in this instance is 78.89, and the extremely low p-value of 5.84e-69 indicates that the model as a whole is significant and can account for a significant amount of the variation in the dependent variable.
The null hypothesis that all regression coefficients are equal to zero, suggesting that there is no relationship between the independent factors and the dependent variable, can be used to interpret the F-statistic. The alternative hypothesis is that there is a relationship between the independent factors and the dependent variable if at least one regression coefficient is not equal to zero.
We reject the null hypothesis and accept the alternative hypothesis in this model because the F-statistic is high and the p-value is very little. This suggests that the independent factors and the dependent variable are related, and that the model as a whole is effective in foretelling the dependent variable.
2. Dropp all the variables that are not statistically significant at least at 90% confidence level. Run another regression model with price as the dependent variable and the rest of the variables as the independent variabls. Call it model2. Print a summary of the results and answer the questions bellow.
## your code starts here
drop_not_significant= []
for i in range(len(model1.pvalues)):
if model1.pvalues[i] > 0.1:
drop_not_significant.append(model1.params.index[i])
df3 = df2.drop(drop_not_significant, axis=1)
df3
| width | heights | engine_size | stroke | horse_power | peak_rpm | highway_mpg | price | fuel_type_gas | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 64.1 | 48.8 | 130 | 2.68 | 111 | 5000 | 27 | 13495 | 1 |
| 1 | 64.1 | 48.8 | 130 | 2.68 | 111 | 5000 | 27 | 16500 | 1 |
| 2 | 65.5 | 52.4 | 152 | 3.47 | 154 | 5000 | 26 | 16500 | 1 |
| 3 | 66.2 | 54.3 | 109 | 3.40 | 102 | 5500 | 30 | 13950 | 1 |
| 4 | 66.4 | 54.3 | 136 | 3.40 | 115 | 5500 | 22 | 17450 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 190 | 68.9 | 55.5 | 141 | 3.15 | 114 | 5400 | 28 | 16845 | 1 |
| 191 | 68.8 | 55.5 | 141 | 3.15 | 160 | 5300 | 25 | 19045 | 1 |
| 192 | 68.9 | 55.5 | 173 | 2.87 | 134 | 5500 | 23 | 21485 | 1 |
| 193 | 68.9 | 55.5 | 145 | 3.40 | 106 | 4800 | 27 | 22470 | 0 |
| 194 | 68.9 | 55.5 | 141 | 3.15 | 114 | 5400 | 25 | 22625 | 1 |
195 rows × 9 columns
#Running another regression model(model 2) with price as the dependent variable and the rest of the variables as the independent variables
X2 = df3.drop('price', axis=1)
y2 = df3['price']
X2 = sm.add_constant(X2)
model2 = sm.OLS(y2, X2).fit()
model2
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f06b2161900>
model2.summary()
| Dep. Variable: | price | R-squared: | 0.854 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.848 |
| Method: | Least Squares | F-statistic: | 136.5 |
| Date: | Sun, 30 Apr 2023 | Prob (F-statistic): | 1.29e-73 |
| Time: | 04:16:57 | Log-Likelihood: | -1842.2 |
| No. Observations: | 195 | AIC: | 3702. |
| Df Residuals: | 186 | BIC: | 3732. |
| Df Model: | 8 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -6.156e+04 | 1.49e+04 | -4.138 | 0.000 | -9.09e+04 | -3.22e+04 |
| width | 566.4440 | 196.395 | 2.884 | 0.004 | 178.995 | 953.893 |
| heights | 289.5923 | 113.100 | 2.561 | 0.011 | 66.470 | 512.715 |
| engine_size | 131.2899 | 13.724 | 9.566 | 0.000 | 104.215 | 158.365 |
| stroke | -2942.0083 | 775.472 | -3.794 | 0.000 | -4471.859 | -1412.158 |
| horse_power | 43.1639 | 15.720 | 2.746 | 0.007 | 12.152 | 74.176 |
| peak_rpm | 2.3532 | 0.639 | 3.683 | 0.000 | 1.093 | 3.614 |
| highway_mpg | 39.9588 | 69.437 | 0.575 | 0.566 | -97.027 | 176.944 |
| fuel_type_gas | -3384.0156 | 998.300 | -3.390 | 0.001 | -5353.463 | -1414.568 |
| Omnibus: | 18.233 | Durbin-Watson: | 0.944 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 61.116 |
| Skew: | 0.177 | Prob(JB): | 5.36e-14 |
| Kurtosis: | 5.720 | Cond. No. | 3.39e+05 |
#variance of the model
model2.mse_resid
9852556.110539043
1.The intercept in this model (const) represents the estimated mean value of the dependent variable (price) when all the independent variables are equal to zero. In other words, it is the predicted price for a car with zero width, zero height, zero engine size, zero stroke, zero horsepower, zero peak rpm, and zero highway mpg, regardless of its fuel type. However, since this is not a meaningful or realistic scenario, the intercept should be interpreted with caution. In practice, the intercept may not have a practical interpretation and the focus should be on interpreting the coefficients of the significant independent variables.
2.In the second regression model (model2), there are 8 independent variables, and all of them are statistically significant at least at a 90% confidence level (p-value less than 0.1).
3.The variance of the model can be estimated using the mean squared error (MSE) of the residuals, which is provided by the mse_resid attribute of the model2 object. The variance of the model is approximately 9,852,556.11.
4.The proportion of the variance in the dependent variable (price) that can be explained by the independent variables (width, heights, engine_size, stroke, horse_power, peak_rpm, highway_mpg, and fuel_type_gas) is expressed statistically by the coefficient of determination (R-squared). Model 2's R-squared score is 0.854, which indicates that the independent variables in the model can account for about 85.4% of the price fluctuation.
The adjusted R-squared is a modified version of the R-squared that adjusts for the number of independent variables in the model. The adjusted R-squared value for model2 is 0.848, which is slightly lower than the R-squared value of model1 (0.866). This indicates that model1 might have slightly overfitted the data by including some unnecessary independent variables. Overall, model2 seems to be a better model because it has a higher adjusted R-squared value and a smaller number of independent variables.
5.The F-statistic is a measure of the overall significance of the regression model. It tests the null hypothesis that all the regression coefficients in the model are equal to zero, indicating that the independent variables are not good predictors of the dependent variable. If the F-statistic is large and the associated p-value is small (typically less than 0.05), we reject the null hypothesis and conclude that the regression model is statistically significant.
The F-statistic for this model is 136.5, and the p-value is a very low 1.29e-73. As a result, we can rule out the null hypothesis that all of the regression coefficients are equal to zero and conclude that the regression model is very significant. In light of this, we may say that the independent factors are reliable indicators of the price, which is the dependent variable.
3. Compare the two models with ANOVA. What are your null and alternative hypothesis? What is your conclusion?
H0: The reduced model is better
Ha: The multiple regression model is better
OR
H0: all the additional variable coefficients are equal to 0
Ha: some of the coefficients are not zero and have explanatory power
##your code goes here
anova_lm(model2,model1)
| df_resid | ssr | df_diff | ss_diff | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 0 | 186.0 | 1.832575e+09 | 0.0 | NaN | NaN | NaN |
| 1 | 180.0 | 1.764456e+09 | 6.0 | 6.811918e+07 | 1.15819 | 0.3308 |
The null hypothesis is that the reduced model (model1) is better, while the alternative hypothesis is that the multiple regression model (model2) is better. The ANOVA test compares the sum of squared residuals (SSR) between the two models, and calculates the F-statistic and p-value. The F-statistic is 1.15819 and the p-value is 0.3308. Since the p-value is greater than the significance level of 0.05, we fail to reject the null hypothesis. This means that the reduced model (model1) is not significantly worse than the multiple regression model (model2) in explaining the variability in the data. Therefore, we can conclude that model1 (the reduced model) is a better fit for this data or there is insufficient evidence to suggest that the additional variables in model2 significantly improve the model fit.
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'wheel_base', data = df2)
<Axes: xlabel='wheel_base', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'length', data = df2)
<Axes: xlabel='length', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'horse_power', data = df2)
<Axes: xlabel='horse_power', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'city_mpg', data = df2)
<Axes: xlabel='city_mpg', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'curb_weight', data = df2)
<Axes: xlabel='curb_weight', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'engine_size', data = df2)
<Axes: xlabel='engine_size', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'bore', data = df2)
<Axes: xlabel='bore', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'stroke', data = df2)
<Axes: xlabel='stroke', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'comprassion', data = df2)
<Axes: xlabel='comprassion', ylabel='price'>
sns.set(rc = {'figure.figsize':(10,5)})
sns.residplot(y = 'price',x = 'highway_mpg', data = df2)
<Axes: xlabel='highway_mpg', ylabel='price'>
The above generates a series of residual plots between the target variable "price" and each of the independent variables in the dataset. A residual plot is a graphical method used to diagnose the goodness of fit of a regression model by examining the residuals, which are the differences between the predicted values and the actual values.
The correlation between the residuals and the associated independent variable is displayed in each plot. If the plot shows no obvious pattern or trend, the assumptions of linear regression are likely true, and the model reasonably fits the data. A unique pattern or trend in the plot, on the other hand, indicates that the model does not adequately fit the data and that there may be problems with the linear regression assumptions, such as nonlinearity, heteroscedasticity, or outliers.
Based on the plots, it seems that some of the independent variables have a good linear relationship with the target variable, while others have a weaker relationship or even no relationship at all. For example, the residual plot for "horse_power" shows a clear curved pattern, suggesting that the linear relationship between "horse_power" and "price" may not be appropriate. In contrast, the residual plot for "curb_weight" shows a relatively random pattern with no clear trend, suggesting that the linear relationship between "curb_weight" and "price" is appropriate. Overall, these plots can be useful in identifying potential issues with the linear regression model and guiding the selection of appropriate independent variables for the model.
4. Checking the assumptions on the model you chose based on ANOVA:
-What are the assumptions?
-Do they hold? How do you test/check?
The assumptions for linear regression are:
Linearity: The relationship between the independent variables and the dependent variable should be linear.
Independence of Errors: The errors or residuals (the difference between the actual and predicted values) should be independent of each other.
Homoscedasticity: The variance of the residuals should be constant across all levels of the independent variables.
Normality: The residuals should be normally distributed.
No multicollinearity: There should be no high correlation between independent variables.
No influential points: No outliers or influential points should be present in the data.
##Normality test
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from scipy import stats
from statsmodels.compat import lzip
import statsmodels
statsmodels.stats.stattools.durbin_watson(model1.resid)
0.9784765574137814
The residuals' autocorrelation is measured using the Durbin-Watson test statistic. No autocorrelation is indicated by a value of 2, and the statistic has a range of 0 to 4.
In this case, the Durbin-Watson statistic for model1 is 0.9784765574137814, which is very close to 2. This suggests that there is little to no autocorrelation in the residuals, and the normality assumption holds for the model.
name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
test = sms.jarque_bera(model1.resid)
lzip(name, test)
[('Jarque-Bera', 55.21077700280737),
('Chi^2 two-tail prob.', 1.025963954971646e-12),
('Skew', 0.24046525207000358),
('Kurtosis', 5.562006714350943)]
The Jarque-Bera test statistic is 55.21 according to the output, and the Chi2 two-tail probability is extremely low (1.03e-12), providing strong evidence that the null hypothesis of the normality of the residuals is false. Additionally, the kurtosis is 5.56, indicating a heavier tail than the normal distribution, and the skewness is 0.24, indicating a distribution that is almost symmetric. The normality assumption has been violated as a result.
#Q-Q Plot for testing Normality
stats.probplot(model1.resid, dist="norm", plot= plt)
plt.title("Model1 Residuals Q-Q Plot")
#Saving plot as a png
plt.savefig("Model1_Resid_qqplot.png")
Based on the output, it appears that the residuals of model1 may not be normally distributed. The Jarque-Bera test has a low p-value, indicating that the residuals are not normally distributed. The Q-Q plot also shows some deviation from the diagonal line, further suggesting non-normality.
#Constant variance test
name = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']
test = sms.het_breuschpagan(model1.resid, model1.model.exog)
lzip(name, test)
[('Lagrange multiplier statistic', 77.88393057396982),
('p-value', 6.971426693124656e-11),
('f-value', 8.550191502949835),
('f p-value', 4.925064760404709e-14)]
The results of the het_breuschpagan test show that heteroscedasticity is present in the residuals at a significance level of 0.05, rejecting the null hypothesis of homoscedasticity.
# fitted values
model_fitted_y = model1.fittedvalues
# Plot
plot = sns.residplot(x=model_fitted_y, y='price', data=df2, lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
# Titel and labels
plot.set_title('Residuals vs Fitted')
plot.set_xlabel('Fitted values')
plot.set_ylabel('Residuals');
The plot shows the relationship between the fitted values and the residuals. The red line in the middle represents the lowess smooth curve, which helps to detect any patterns in the residuals. Ideally, we want to see a random scatter of points with no apparent pattern, which would indicate that the model's assumptions are met.
In this case, the plot seems to show some heteroscedasticity (non-constant variance) in the residuals. This means that the variance of the residuals is not the same across all fitted values, which violates one of the assumptions of linear regression. This is also supported by the result of the Breusch-Pagan test for heteroscedasticity, which showed a low p-value, indicating rejection of the null hypothesis of constant variance.
fig = sm.graphics.plot_partregress_grid(model1)
fig.tight_layout(pad=1.5)
eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1 eval_env: 1
/usr/local/lib/python3.10/dist-packages/statsmodels/graphics/regressionplots.py:565: UserWarning: Tight layout not applied. tight_layout cannot make axes height small enough to accommodate all axes decorations. fig.tight_layout() <ipython-input-43-83cb37f6bf35>:2: UserWarning: Tight layout not applied. tight_layout cannot make axes height small enough to accommodate all axes decorations. fig.tight_layout(pad=1.5)
This partial regression graph depicts the relationship between the dependent variable and an independent variable after taking into consideration the other independent variables. The graph should show a linear relationship between the two variables.
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model1, 'horse_power', fig=fig)
plt.show()
eval_env: 1
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model1, 'engine_size', fig=fig)
plt.show()
eval_env: 1
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model1, 'comprassion', fig=fig)
plt.show()
eval_env: 1
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model1, 'highway_mpg', fig=fig)
plt.show()
eval_env: 1
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model1, 'fuel_type_gas', fig=fig)
plt.show()
eval_env: 1
This code generates many diagnostic graphs as well as regression plots of the independent factors and the dependent variable (price).
In the aforementioned graphs, a regression line and the dependent variable (price) are plotted against the independent variables in the top left corner. A Q-Q plot of the residuals is displayed in the top right plot to help you decide if they are normal. In the bottom left plot, the residuals are displayed in relation to the independent variable to check for linearity. While accounting for other model factors, the partial regression figure in the lower right shows the relationship between the independent variable and the dependent variable (price).
These figures are used to assess the assumptions of the linear regression model and search for any potential issues such as non-linearity, non-normal residuals, and outliers.
plt.rc("figure", figsize=(15, 8))
plt.rc("font", size=14)
fig = sm.graphics.influence_plot(model1, criterion="cooks")
fig.tight_layout(pad=1.8)
The influence plot shows the values of Cook's distance for each observation in the dataset. The Cook's distance calculates the impact of removing a particular observation on the predicted regression coefficients. The regression model is more strongly affected by large values of Cook's distance for the observations.
Each point in the diagram represents one observation and the Cook's distance value is indicated on the y-axis. The horizontal line at 0.5 is one often used criterion for identifying important observations.
5. Is there Multicollinearity in your data?
Calculate VIF for both the full model and the reduce model. What do you notice?
Variance inflation factor(VIF) VIF<max(10,1/1-Rsquared)
Full Model: 1/1-0.860=7.143 Reduced Model: 1/1-0.854=6.85
# Change array to dataframe. For each X, calculate VIF and save in dataframe. Anything above 10 will suggest multicollinearity
# if you did not change your data into matrix format, you may not need to make any changes.
X_df2=pd.DataFrame(X)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_df2.values, i) for i in range(X_df2.shape[1])]
vif["Feature"] = X_df2.columns
vif.round(2)
| VIF Factor | Feature | |
|---|---|---|
| 0 | 6734.07 | const |
| 1 | 7.98 | wheel_base |
| 2 | 10.54 | length |
| 3 | 5.83 | width |
| 4 | 2.26 | heights |
| 5 | 16.45 | curb_weight |
| 6 | 8.82 | engine_size |
| 7 | 2.13 | bore |
| 8 | 1.49 | stroke |
| 9 | 65.09 | comprassion |
| 10 | 9.25 | horse_power |
| 11 | 2.18 | peak_rpm |
| 12 | 26.51 | city_mpg |
| 13 | 24.69 | highway_mpg |
| 14 | 65.97 | fuel_type_gas |
The independent variables of models 1 and 2 appear to be extremely multicollinear with one another based on the VIF values. Specific parameters with VIF values more than 10, which signify high multicollinearity, include wheel_base, length, breadth, curb weight, compression ratio, horsepower, city mpg, and fuel_type_gas. This suggests that these features' inclusion in the model needs to be reassessed because they might be redundant or highly correlated with other features.
X_df = pd.DataFrame(X)
corr_matrix = X_df.corr()
print(corr_matrix)
const wheel_base length width heights curb_weight \
const NaN NaN NaN NaN NaN NaN
wheel_base NaN 1.000000 0.879222 0.819009 0.592500 0.782720
length NaN 0.879222 1.000000 0.858084 0.496218 0.881665
width NaN 0.819009 0.858084 1.000000 0.315834 0.867315
heights NaN 0.592500 0.496218 0.315834 1.000000 0.307732
curb_weight NaN 0.782720 0.881665 0.867315 0.307732 1.000000
engine_size NaN 0.569704 0.687479 0.740320 0.031286 0.857573
bore NaN 0.498228 0.609437 0.544311 0.189283 0.645806
stroke NaN 0.171722 0.118664 0.186432 -0.055525 0.172785
comprassion NaN 0.247730 0.160172 0.190997 0.261160 0.155382
horse_power NaN 0.375541 0.583813 0.616779 -0.084412 0.760285
peak_rpm NaN -0.352331 -0.280986 -0.251627 -0.264078 -0.278944
city_mpg NaN -0.499126 -0.689660 -0.647099 -0.102367 -0.772171
highway_mpg NaN -0.566355 -0.719324 -0.692220 -0.151188 -0.812710
fuel_type_gas NaN -0.303643 -0.210618 -0.245375 -0.279070 -0.219488
engine_size bore stroke comprassion horse_power \
const NaN NaN NaN NaN NaN
wheel_base 0.569704 0.498228 0.171722 0.247730 0.375541
length 0.687479 0.609437 0.118664 0.160172 0.583813
width 0.740320 0.544311 0.186432 0.190997 0.616779
heights 0.031286 0.189283 -0.055525 0.261160 -0.084412
curb_weight 0.857573 0.645806 0.172785 0.155382 0.760285
engine_size 1.000000 0.583091 0.211989 0.024617 0.842691
bore 0.583091 1.000000 -0.066793 0.003057 0.568527
stroke 0.211989 -0.066793 1.000000 0.199882 0.100040
comprassion 0.024617 0.003057 0.199882 1.000000 -0.214401
horse_power 0.842691 0.568527 0.100040 -0.214401 1.000000
peak_rpm -0.219008 -0.277662 -0.068300 -0.444582 0.105654
city_mpg -0.710624 -0.591950 -0.027641 0.331413 -0.834117
highway_mpg -0.732138 -0.600040 -0.036453 0.267941 -0.812917
fuel_type_gas -0.063490 -0.056245 -0.253774 -0.985398 0.168454
peak_rpm city_mpg highway_mpg fuel_type_gas
const NaN NaN NaN NaN
wheel_base -0.352331 -0.499126 -0.566355 -0.303643
length -0.280986 -0.689660 -0.719324 -0.210618
width -0.251627 -0.647099 -0.692220 -0.245375
heights -0.264078 -0.102367 -0.151188 -0.279070
curb_weight -0.278944 -0.772171 -0.812710 -0.219488
engine_size -0.219008 -0.710624 -0.732138 -0.063490
bore -0.277662 -0.591950 -0.600040 -0.056245
stroke -0.068300 -0.027641 -0.036453 -0.253774
comprassion -0.444582 0.331413 0.267941 -0.985398
horse_power 0.105654 -0.834117 -0.812917 0.168454
peak_rpm 1.000000 -0.069493 -0.016950 0.480952
city_mpg -0.069493 1.000000 0.972350 -0.260796
highway_mpg -0.016950 0.972350 1.000000 -0.193998
fuel_type_gas 0.480952 -0.260796 -0.193998 1.000000
The correlation matrix shows the correlation coefficients of the explanatory variables for the dataset. A linear relationship between two variables is evaluated by the correlation coefficient, which has a range of -1 to 1.
The diagonal values are all NaNs since a variable's correlation with itself is always equal to 1. The analysis is not aided by these numbers.
As can be seen, some combinations of the variables have a substantial positive correlation, such as length and wheel_base (0.879222), width and length (0.858084), and curb_weight and engine_size (0.857573). To avoid multicollinearity in the model, it is preferable to delete some of these variables due to the potential for redundant information in them.
from scipy.stats import boxcox
y1= df2.iloc[:, 1].values
y,fitted_lambda= boxcox(y1,lmbda=None)
fitted_lambda
0.29681938324379553
Using maximum likelihood estimation (MLE), the lambda parameter was estimated to convert the input variable y1 to a normal distribution. The boxcox function was used to calculate the fitted_lambda value in question.
Given that the expected value of lambda is predicted to be 0.29681938324379553, which indicates that a power transformation of roughly 0.3 is appropriate, the distribution of the input variable y1 should be normalized in this case.
y=y.reshape(-1,1)
model4 = sm.OLS(y,X).fit()
model4.summary()
| Dep. Variable: | y | R-squared: | 0.999 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.999 |
| Method: | Least Squares | F-statistic: | 2.346e+04 |
| Date: | Sun, 30 Apr 2023 | Prob (F-statistic): | 2.70e-285 |
| Time: | 04:17:14 | Log-Likelihood: | 671.49 |
| No. Observations: | 195 | AIC: | -1313. |
| Df Residuals: | 180 | BIC: | -1264. |
| Df Model: | 14 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 7.6784 | 0.047 | 162.369 | 0.000 | 7.585 | 7.772 |
| wheel_base | -0.0015 | 0.000 | -5.788 | 0.000 | -0.002 | -0.001 |
| length | 0.0275 | 0.000 | 182.699 | 0.000 | 0.027 | 0.028 |
| width | -0.0011 | 0.001 | -1.662 | 0.098 | -0.002 | 0.000 |
| heights | -0.0003 | 0.000 | -0.696 | 0.487 | -0.001 | 0.000 |
| curb_weight | -1.387e-05 | 4.47e-06 | -3.105 | 0.002 | -2.27e-05 | -5.05e-06 |
| engine_size | -0.0001 | 4.14e-05 | -3.060 | 0.003 | -0.000 | -4.5e-05 |
| bore | 0.0077 | 0.003 | 2.468 | 0.015 | 0.002 | 0.014 |
| stroke | 0.0072 | 0.002 | 3.211 | 0.002 | 0.003 | 0.012 |
| comprassion | 0.0030 | 0.001 | 2.580 | 0.011 | 0.001 | 0.005 |
| horse_power | -3.615e-06 | 4.64e-05 | -0.078 | 0.938 | -9.52e-05 | 8.79e-05 |
| peak_rpm | -7.102e-06 | 1.82e-06 | -3.897 | 0.000 | -1.07e-05 | -3.51e-06 |
| city_mpg | -0.0021 | 0.000 | -4.498 | 0.000 | -0.003 | -0.001 |
| highway_mpg | 0.0006 | 0.000 | 1.356 | 0.177 | -0.000 | 0.001 |
| fuel_type_gas | 0.0285 | 0.015 | 1.848 | 0.066 | -0.002 | 0.059 |
| Omnibus: | 72.163 | Durbin-Watson: | 1.407 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 242.816 |
| Skew: | -1.491 | Prob(JB): | 1.88e-53 |
| Kurtosis: | 7.582 | Cond. No. | 4.77e+05 |
The output displays the findings of a linear regression analysis between the dependent variables (features) in X and the target variable (y). The target variable is altered with the Box-Cox method to make it normally distributed before the model is fitted.
With an R-squared of 0.999, the model successfully explains 99.9% of the variation in the target variable. The model appears to have a good fit to the data, as indicated by the adjusted R-squared value of 0.999.
The whole model appears to be statistically significant, and at least one of its independent variables is significantly related to the target variable, according to the F-statistic of 2.346e+04 and the corresponding p-value of 2.70e-285.
The coefficient estimates for each independent variable indicate the direction and strength of their association with the target variable. The p-values associated with each coefficient estimate test the null hypothesis that the true coefficient value is zero. If the p-value is less than 0.05, we can reject the null hypothesis and conclude that the corresponding independent variable is significantly related to the target variable.
It is possible that there is some positive autocorrelation in the residuals, according to the Durbin-Watson statistic of 1.407. The null hypothesis that the residuals are normally distributed is tested using the Jarque-Bera (JB) statistic. The residuals are not normally distributed and the null hypothesis is rejected since the JB statistic's p-value is less than 0.05.
There may be some multicollinearity among the independent variables, as shown by the large condition number (4.77e+05), which could impact the interpretation of the model and result in unstable coefficient estimations.